Lecture 2: Project management, version control, and the shell

EC7412 Part II: Data Science for Economists

Adam Altmejd Selder

Swedish Institute for Social Research (SOFI)

April 11, 2025

Project management

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Project management

Software install check (Problem set 0)

Did you…

You will need it to follow along today

Project management

Why?

  • Replication is key
  • Running code should give same results, now and later
  • Without replication, results are not credible
  • Much easier if considered from the start
  • Your collaborators (and future self) will thank you!

Project management

Warning signs

  • You can’t find the file you are looking for
  • You’re not sure if your data has been processed
    • Have outliers been removed?
  • You have data/script files in your home/download folder
    • What project is summarize_earnings.R for?
  • You don’t know if you working with the latest version (of a script or text)
  • You and your colleagues are working on different versions

Project management

Good code makes replication straight forward

  • Like math (but unlike words) good code provides exact documentation
  • How was income calculated, were outliers removed, what was done with missing or invalid observations
  • Needs to be in the back of your mind constantly

Project management

File and folder structure

  • One (self-contained) folder per project
  • Use relative paths
  • Never change raw data, process data to new file(s)
  • Consistent and careful naming
  • Split code into multiple files by feature
  • Document everything, describe project in README.md file
  • Have a backup strategy

 

Project management

Use self-contained project folders

  • Each project has its own folder (named “project_name”)
  • Perhaps in a parent folder called “projects”
  • All project files are in this folder
  • Self-contained: does not reference any files outside folder

Why?

  • Portable
  • Easy to back up
  • Easy to share

Project management

Use relative paths and start new sessions

Many do-files start like this:

clear all
cd "C:/my_project/analysis"
use "mydata.dta"
clear all
use "C:/my_project/analysis/mydata.dta"

Don’t do this!

  • use relative paths
  • start new sessions for each script
use "mydata.dta"

Project management

Data management

  • Keep raw data in data/raw folder
  • Never edit it, use write-protection
  • Save processed data to new files
  • Processed data should be:
    • Tidy
    • Deduplicated
    • Carefully named

Project management

Consistent and careful naming

  • Name folders, files, variables so anyone can understand
  • Long names are better than abbreviations you will forget
  • Figures: scatter_income_by_gender.png not figure_3.png
  • Variables: log_family_income not inc4

Project management

Split code into multiple files by feature

  • You should be splitting your code into functions by feature,
  • Put these functions into different files
  • Instead of functions.R, maybe you have simulation.R, arithmetic.R, robustness_checks.R, etc.

Project management

Document everything, describe project in README.md file

  • Place a file called README.md in your project root
  • Use this file to describe the project, explain how to run your code, etc.
  • For your future self and those you share the project with.

Project management

Have a backup strategy

  • Make sure your data is backed up
  • Recommended: use a cloud sync service, like OneDrive
  • Do not use Git for backup

Short shell intro

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Short shell intro

  • Mac and Linux are both Unix OS and have Linux shells
  • Git for Windows provides Bash emulation
  • Why?
    • Almost all servers speak shell
    • To know what goes on “under the hood”
    • Automation
    • Reproducibility (again!)
    • Useful tricks for e.g., file manipulation:
find "raw_data" -name "*.csv" -type f -exec \
  perl -i -0pe 's/Vet ej\/\nVill ej svara/Vet ej\/Vill ej svara/g' {} \;

Short shell intro

Git Bash on Windows

Terminal on Mac
  • You can also open a shell from VS Code by going to “Terminal->New Terminal”
  • In VS Code on Windows, the terminal defaults to Powershell. You can change it to Git Bash in the settings, under “Default profile”.

Short shell intro

Basic shell commands: navigation

# pwd prints working directory
pwd
/Users/adam/Library/CloudStorage/Dropbox/Teaching/Data science for economists/datascience-course/lectures/lecture_2
# cd changes directory
# ".." means up one level
cd ..
pwd
/Users/adam/Library/CloudStorage/Dropbox/Teaching/Data science for economists/datascience-course/lectures
# ls lists files in folder
ls
git.png
git_branch_create.gif
git_branch_local_merge.gif
git_branch_pull_request.gif
git_clone_vscode.gif
git_init_repo.gif
git_merge.gif
git_merge_conflict.png
git_merge_conflict_combination.png
git_merge_conflict_editor.png
git_stage_commit_push.gif
gitignore.png
lecture_2.html
lecture_2.qmd
lecture_2.rmarkdown
man_ls.png
phd_git_comic.gif
project_folder_structure.png
readme.png
shell_mac.png
shell_win.png

Short shell intro

Commands and flags

  • Shell commands often have options (flags)
  • Flags always start with one or two dashes
  • Example ls:
    • -l = list
    • -a = show all, also hidden
    • -h = human readable file sizes
ls -lah
total 80240
drwxr-xr-x@ 23 adam  staff   736B Mar 31 16:51 .
drwxr-xr-x@ 11 adam  staff   352B Mar 24 12:10 ..
-rw-r--r--@  1 adam  staff    48K Mar  8 17:44 git.png
-rw-r--r--@  1 adam  staff   1.2M Mar 14 11:55 git_branch_create.gif
-rw-r--r--@  1 adam  staff   1.1M Mar 14 11:55 git_branch_local_merge.gif
-rw-r--r--@  1 adam  staff   5.5M Mar 14 11:55 git_branch_pull_request.gif
-rw-r--r--@  1 adam  staff   1.3M Mar 13 14:10 git_clone_vscode.gif
-rw-r--r--@  1 adam  staff   1.0M Mar 13 14:04 git_init_repo.gif
-rw-r--r--@  1 adam  staff   3.4M Mar 13 16:22 git_merge.gif
-rw-r--r--@  1 adam  staff   142K Mar 13 16:35 git_merge_conflict.png
-rw-r--r--@  1 adam  staff    59K Mar 13 16:35 git_merge_conflict_combination.png
-rw-r--r--@  1 adam  staff    58K Mar 13 16:35 git_merge_conflict_editor.png
-rw-r--r--@  1 adam  staff   987K Mar 13 15:34 git_stage_commit_push.gif
-rw-r--r--@  1 adam  staff   9.4K Mar 13 17:37 gitignore.png
-rw-r--r--@  1 adam  staff    24M Mar 31 16:43 lecture_2.html
-rw-r--r--@  1 adam  staff    18K Mar 31 16:50 lecture_2.qmd
-rw-r--r--   1 adam  staff    18K Mar 31 16:51 lecture_2.rmarkdown
-rw-r--r--@  1 adam  staff    49K Mar 23 15:29 man_ls.png
-rw-r--r--@  1 adam  staff   111K Mar  8 17:44 phd_git_comic.gif
-rw-r--r--@  1 adam  staff    17K Mar 13 17:07 project_folder_structure.png
-rw-r--r--@  1 adam  staff   150K Mar 13 17:40 readme.png
-rw-r--r--@  1 adam  staff    12K Mar 14 11:21 shell_mac.png
-rw-r--r--@  1 adam  staff    32K Mar 14 11:21 shell_win.png

Short shell intro

More commands

  • mv <file> <destination> moves
  • cp <file> <destination> copies
  • rm <file> removes
  • mkdir <directory> creates a directory, rmdir removes
  • find searches

Short shell intro

man for manual

man ls

 

Use space to browse, press h for help and q to quit.

Short shell intro

  • This was just a very basic introduction
  • Hopefully the sight of a command line will not scare you from now on
  • Check out the resources at the end of today’s talk if you want to learn more!

Version control

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Version control

Why?

 

Version control

Git + GitHub + VS Code

  • Git: (distributed) version control system
  • File history, track changes, cloud sync, collaboration
  • Works best for code (text files)
  • GitHub is a platform to host Git repositories.
    • Extremely popular in software development
    • Getting popular in academia, great for open science
  • Seamless integration with VS Code

Version control

How does Git work?

  • A project = a Git repository
  • Each user works in local copy of project
  • Labelled versioning with commits
  • Branches enable multi-feature development
  • Changes sent to GitHub on demand (unlike Dropbox)
  • Facilitates collaboration with branching, conflict management, pull requests

Basic Git workflow

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Basic Git workflow

Creating your first Git repository

Let’s go through how to set up and work in a simple Git repository:

  1. Initialize a repository on GitHub with a README file
  2. Clone (download) the repository to our computer using VS Code
  3. Make a change to the codebase
  4. Stage the change
  5. Commit the change
  6. Push the change to GitHub

Slides will have screen casts and shell commands.

Basic Git workflow

Initialize a repository

 
git init test-repo

Basic Git workflow

Clone a repo using VS Code

 

git clone https://github.com/adam-test-acc/test-repo.git

Basic Git workflow

Stage, commit, and push changes

 
git stage README.md
git commit -m "Readme change"
git push

Follow the instructions in Problem set 0 to set up your Git user.name and user.email. Otherwise you will get an error when pressing “Commit”.

Basic Git workflow

Stage and commit in two steps, why?

  • git commit just commits staged files
  • We use git add <file> to chose what to commit
    • Can even git add --patch <file> to stage parts
  • This allows to divide up changes into multiple commits based on features/components

Basic Git workflow

What we just did

  • Initialized repo on GitHub, cloned to local machine
  • Edited the README file in the repo
  • Staged the edit: told Git we want to record the change we just made
  • Committed (= recorded) the edit into the repo history with a helpful message
  • Pushed the edited repo to the upstream remote (=GitHub).

Basic Git workflow

Credentials

  • Locally, Git uses information stored in user.name and user.email to tell others who you are.
  • When Pulling/pushing changes to/from GitHub, VS Code needs your login. Anyone can clone a public repository, but only you (and those you invite) can push changes to your repo.

Basic Git workflow

README=GitHub landing page

  • Should explain purpose
  • Requirements to run code
  • Can also be in subdirectories
  • Should be written in Markdown syntax

Basic Git workflow

.gitignore file

  • What file patterns to ignore
  • You should ignore:
    • Data (especially sensitive!)
    • Very large files
    • Binary files (e.g. pdf)
    • Output files (maybe)

 
  • Remember: Git is not backup. You should have a backup solution too.

Merge conflicts

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Merge conflicts

Code change by multiple sources

 

Two contributors have made changes to the same file. Git stops merge to not overwrite changes. Requires manual conflict resolution.

Merge conflicts

 

Merge conflicts

Resolving conflicts using the merge editor

 

Merge conflicts

Resolving conflicts using the merge editor

 

Merge conflicts

Manual edits

VS Code provides a great UI for resolving merge conflicts, but you can also do it manually by editing the conflicting files. If you open the file you will see something like this. The strange characters are Git’s way of highlighting the merge conflict.

# test-repo
<<<<<<< HEAD
A repo for testing and having fun.
=======
A repo for playing around.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
A not so cool change to the README file.

You fix the conflict manually by simply removing the characters and chosing the text you prefer.

Branches

  • Project management

  • Short shell intro

  • Version control

  • Basic Git workflow

  • Merge conflicts

  • Branches

Branches

  • Want to test a large change, but unsure it will work?
  • Create a new branch to try it out, then just revert to main if it fails.
  • Keep track of your changes (with commits) without disturbing your collaborators with unfinished code.
  • If you are happy with your changes, merge them to the main branch.
  • Branching is awesome, use it!

Branches

Creating a new branch in VS Code

 
git switch --create test-branch

Branches

Merging branches locally

 
git switch master
git merge test-branch

Branches

Merging branches with pull requests

 

Git advice

  • Commit often, work in small features
    • Don’t: “New data processing script”
    • Do: “Removed duplicates from survey data”
  • Push changes when you want your collaborators to see
  • Git is not backup!
  • Branches are useful even when solo

When all else fails… 🤷

 

Next lecture: R basics

Resources

Extras: SSH

SSH

SSH=Secure shell

  • Logs in to server over remote channel
  • Really useful but setup requires some work
  • Uses public key cryptography
    • Private key - stored on your computer
    • Public key - stored on the server
    • When connecting the private key encrypts a message that can only be validated by the corresponding public key

SSH

Generating a key pair

To generate an SSH key pair and register the public key with GitHub, follow these steps. For detailed instructions, see GitHub Docs.

ssh-keygen -t ed25519 -C "your_email@example.com"

Press Enter to accept the default file location. When prompted, enter a secure passphrase (your terminal will not show characters as you type).

SSH

Storing the passphrase using ssh-agent

You should secure your keys with a passphrase. But it might become annoying having to type the passphrase each time you use the SSH key. Instead, you can use a ssh-agent to store the passphrase for you. On Mac/Linux this is easy:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

On Windows, it’s more complicated. I’ve included the instructions in Problem Set 0 but let’s go through them together now.

SSH

Connecting to Github over SSH

To use SSH with Github you first need to add your public key to your GitHub profile. Go to settings and “SSH and GPG keys” to add it.

Afterwards, run:

ssh -T git@github.com
Hi adamaltmejd! You've successfully authenticated, but GitHub does not provide shell access.

Now you can clone repositories using SSH rather than https.

We will get back to how to use SSH to connect to servers later in the course.